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ABSTRACT 

The memory consistency model is a fundamental system 
property characterizing a multiprocessor. The relative mer- 
its of strict versus relaxed memory models have been widely 
debated in terms of their impact on performance, hardware 
complexity and programmability. This paper adds a new 
dimension to this discussion: the impact of memory mod- 
els on software reliability. By allowing some instructions 
to reorder, weak memory models may expand the window 
between critical memory operations. This can increase the 
chance of an undesirable thread-interleaving, thus allowing 
an otherwise-unlikely concurrency bug to manifest. To ex- 
plore this phenomenon, we define and study a probabilistic 
model of shared-memory parallel programs that takes into 
account such reordering. We use this model to formally 
derive bounds on the vulnerabtUty to concurrency bugs of 
different memory models. Our results show that for 2 con- 
current threads, weaker memory models do indeed have a 
higher likelihood of allowing bugs. On the other hand, we 
show that as the number of parallel, buggy threads increases, 
the gap between the different memory models becomes pro- 
portionally insignificant, and thus the importance of using 
a strict memory model diminishes. 

Categories and Subject Descriptors 

F.1.2 [Computation by Abstract Devices]: Modes of 
Computation — parallelism and concurrency; G.3 [Proba- 
bility and Statistics]: Stochastic processes; B.3.4 [Mem- 
ory Structures]: Reliability, Testing, and Fault-Tolerance 

General Terms 

Theory, Reliability 

*This paper is a full version of an extended abstract that 
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1. INTRODUCTION 

A critically important property of a shared-memory multi- 
processor is its memory consistency model. There has been 
an enormous amount of work on this subject, both in in- 
dustry and academia. The memory consistency model de- 
scribes which values may be returned by a load operation 
in a parallel or multi-threaded program. The strongest and 
most intuitive model is Sequential Consistency (SC) [15]. SO 
imposes two requirements on the execution of parallel pro- 
grams: first, all processors must see the same global order 
of memory operations, and second, the operations for a par- 
ticular processor must appear to execute in program order. 
This model is attractive for its high level of programmability, 
but the strict constraints on memory operation reordering 
rule out important optimizations such as access buffering, 
pipelining, or dynamic scheduling, which improve perfor- 
mance by hiding the latency of memory accesses. In order 
to enable these aggressive optimizations, a wide variety of 
relaxed memory models have been proposed. Relaxed mem- 
ory models allow the reordering of certain types of memory 
operations at the cost of increased programming complex- 
ity, since programmers need to explicitly encode reordering 
restrictions to ensure correctness. 

Historically, the vast literature on memory consistency 
models has discussed a three-way trade-off between perfor- 
mance, hardware complexity, and programmability. In this 
paper, we bring a new axis to this discussion: software re- 
liability. Software is inherently unreliable, and is arguably 
becoming less reliable with pervasive concurrency. Concur- 
rency bugs such as data races and deadlocks are extremely 
common in practice, and can cause unexpected failures in 
even production-level code. 

In this paper, we investigate to what extent relaxed mem- 
ory consistency models further contribute to the unreliabil- 
ity of parallel software by increasing the likelihood that con- 
currency bugs will manifest during an execution. For this 
purpose, we study a new probabilistic model for the instruc- 
tion reordering introduced by relaxed memory models, and 
analyze a canonical buggy program (specifically, an atom- 
icity violation [21 HI [TT]) with respect to this model. We 
compare three important memory consistency models: Se- 



quential Consistency, Weak Ordering, and Total Store Or- 
der. We derive two interesting results for our model: 

• We show that for 2 (or any small constant number of) 
parallel threads, the bug is indeed more likely to mani- 
fest under weaker memory models. This is intuitive and 
follows from the following high-level argument: A typical 
concurrency bug, such as a data race, can manifest only 
during a short window of time. The reordering of opera- 
tions caused by relaxed memory models may increase the 
size of this critical window, thus making the bug more 
likely to manifest. In the paper, we give precise bounds 
on this vulnerability of the three memory models. 

• On the other hand, we show that as the number of par- 
allel, buggy threads increases, the gap between the dif- 
ferent memory models shrinks in proportion to the risk 
for even the strongest memory model. This implies that 
as the number of parallel threads in the system increases, 
the importance of using a strict memory model dimin- 
ishes (with regard to the software reliability metric we 
study in this paper). 

Notice that the latter result could have far-reaching impli- 
cations on the choice of memory consistency models in future 
multi-core and massively parallel systems. Intuitively, one 
might expect that with more and more concurrent threads, 
stronger memory consistency models should be used in or- 
der to counter the generally increased likelihood of bugs. 
However, our results indicate that the opposite is the case: 
As the number of threads increase, the relative importance 
of having stronger memory models reduces to a minimum. 
The underlying reason is that the larger number of threads 
causes the likelihood that bugs occur to increase much more 
quickly than what even the strictest memory model is able 
to contain. That is, the asymptotic growth fundamentally 
works against using strict memory models as we increase the 
number of threads. 

The technical content of our paper proceeds as follows. 
In Section O we introduce two distinct random processes, 
each of which is a natural object of inquiry in isolation. By 
combining them — treating the output of the first process as 
the input to the second — we model the end-to-end behavior 
of program execution. This allows us to answer our central 
question: how does the probability that a canonical data 
race manifests vary across memory models and quantity of 
threads? 

The first process models the generation of a random pro- 
gram, and the subsequent randomized reordering of instruc- 
tions. Specifically, in Section IH we derive the probability 
that a certain essential window of vulnerability between two 
instructions widens. The second process enacts a random se- 
ries of shifts on a set of heterogenous segments of the integer 
line. We use the positions of these line segments to model 
the interleaving of the vulnerable windows of the threads. In 
Section [S] we estimate the probability that each of these seg- 
ments is shifted to mutually disjoint positions. Finally, the 
two processes are combined together in Section [6] to derive 
overall bounds on the probability of bug manifestation, first 
for two threads, then for a large number of threads. Due to 
lack of space, several proofs are omitted and deferred to the 
appendix. 



2. BACKGROUND & RELATED WORK 
2.1 Memory Consistency Models 

Memory models are a key aspect of the hardware/software 
interface in shared-memory multicore/multiprocessor sys- 
tems. They determine what values read memory operations 
are allowed to return by dictating how memory operations 
are allowed to be reordered, as well as when writes become 
visible to other processors. They have major implications 
on the performance, design complexity and programmabil- 
ity of multiprocessor systems and the programs that run on 
them. Common misunderstandings about memory models 
often lead to bugs that are very difficult to find and fix, 
and can also lead to major performance issues. There ex- 
ists a vast and rich line of literature on memory models (a 
good tutorial overview is presented in [1]). Most of the past 
work has focused on new memory models [111 [2] [13] , hard- 
ware implementations [101 1121 [7], memory models for popu- 
lar languages such as Java [18] and C-I--I- [6], and compiler 
optimizations [16] and their relative merits [l][5]. 

Relaxed memory models: The strongest memory model 
is Lamport's Sequential Consistency (SC) [15]. In order to 
enable important performance optimizations, a number of 
relaxed memory models have been proposed in the literature, 
with varying degrees of guarantees. One of the strongest 
examples is known as Total Store Order (TSO) [19| . In 
TSO, loads may execute before stores that precede them 
in program order, as long as no data dependency is vio- 
lated. All other pairs of instructions must maintain strict 
program order. This model encapsulates the natural case in 
which stores are observed by remote processors in program 
order. Some stores may take extra time to be observed after 
their execution, but the local program is allowed to proceed. 
A similar, but slightly weaker consistency model is Partial 
Store Order (PSO) 19 , which also allows the reordering 
of stores with respect to each other as long as they access 
distinct memory locations. A significantly weaker consis- 
tency model is Weak Ordering (WO) |8j |2j . The opposite 
extreme from Sequential Consistency, WO allows any mem- 
ory operations to reorder with one another, as long as no 
data dependencies are violated. This model allows for an 
equal amount of optimization as a uniprocessor, but is also 
the most vulnerable to programmer error, since it requires 
explicit fences to prevent unwanted reorderings. Modern 
processors typically support relaxed models. For example, 
the x86 memory model [3] [14] supports a model similar to 
TSO and the IBM POWER architecture supports a form of 
WO. 

The above memory consistency models follow a pattern: 
they can be defined by a subset of the four ordered mem- 
ory operation pairs, specifying which pairs are allowed to 
reorder: For example, in the WO model, any two mem- 
ory operations are allowed to be reordered; in SC, no two 
memory operations are allowed to be reordered; and in the 
TSO model, no two memory operations are allowed to be 
reordered, except that loads can reorder before stores (see 
Table [I}. 

Note that since in this paper we analyze a concurrency 
bug involving multiple threads, we ignore store atomicity [5], 
which is tangential to our present analysis. Moreover, we 
do not currently handle fence operations explicitly|3 which 



^However, our shift process in Section[5]can be used to simu- 
late a behavior similar to that arising from the use of fences. 
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LD/ST 
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Sequential Consistency 




X 






Total Store Order 


X 


X 






Partial Store Order 


X 


X 


X 


X 


Weak Ordering 



Table 1: Important memory models. A "X" in col- 
umn ST/LD means that the ordering restriction from 
stores to later loads can be relaxed, i.e., loads can 
complete before stores that precede them in pro- 
gram order. With regard to our model in Sec- 
tion 13.1.21 this means that a LD can settle past (swap 
with) a preceding ST. Other columns are analogous. 

are used to restrict reorderings and are typically used for 
synchronization. For that reason, we do not consider models 
such as Release Consistency (RC) [TlJ, which differs mainly 
in the types of fences supported. As we discuss in Section[71 
it will be interesting to extend our process to distinguish 
such memory models. 

2.2 Race Conditions 

A common type of bug in shared-memory multithreaded 
programming is a race condition, which occurs when cor- 
rectness depends on an assumption about the order in which 
instructions from two or more threads interleave. In partic- 
ular, an atomicity violation 'W occurs when the programmer 
assumes that multiple instructions will execute as an atomic 
unit, but fails to insert the proper synchronization. A re- 
cent study showed that atomicity violations are extremely 
common in "real world" programs [T7]- Race conditions are 
often difficult to identify due to nondeterminism: the pro- 
gram may behave correctly most runs, but fails only for 
specific thread interleavings. 

A canonical example of an atomicity violation is as follows: 



Thread 1 


Thread 2 


1: int loc — x; 


1: Int loc = x; 


2: loc = loc -1- 1; 


2: loc = loc + 1; 


3: X = loc; 


3: X = loc; 



Here x is a shared variable (with x = initially) and loc 
is local to each thread. Two threads simultaneously try to 
increment x by loading its value into a local variable, incre- 
menting that local variable, then storing the updated value 
back to x. The programmer's intent is that x = 2 after both 
threads finish executing. However, the program has a race 
condition that can result in the spurious outcome x = 1. For 
instance, suppose that the two threads interleave as follows: 
(1) Thread 1 executes Lines 1 and 2; (2) Thread 2 executes 
Lines 1 and 2; (3) Thread 1 executes Line 3; (4) Thread 2 
executes Line 3. This interleaving produces the final result 
X = 1. We say that the bug manifests because the result did 
not match programmer intent. 

The standard solution for race conditions like the exam- 
ple above is to protect the variable x with a lock. However, 
locking protocols can be extremely complicated in large pro- 
grams, and in practice, a concurrency bug may easily slip 
past even the most experienced programmers. Note that 
such bugs can manifest in any memory model, even Sequen- 
tial Consistency. 

3. MODEL 

Our goal is to study how the use of different memory 
models impacts the likelihood of an error occurring given 



a canonical atomicity violation. In this section, we describe 
a model that allows us to formally analyze these likelihoods. 
It is a probabilistic model of parallel program executions 
under memory models that may permit reordering. At a 
high level, we consider two or more threads which execute a 
simple program containing an atomicity bug. The program 
consists of basic memory operations (stores and loads). De- 
pending on the memory model under consideration, the op- 
erations in each thread are then independently reordered via 
a random process we call the settling process. Finally, we use 
a thread interleaving model — the shift model — to model the 
execution of the program by interleaving the instructions 
of different threads. The probability of the bug manifest- 
ing is determined by analyzing how the operations from the 
threads interleave. We show in this paper that, when exe- 
cuting two threads, this probability crucially depends on the 
underlying memory model. Yet, perhaps counter-intuitively, 
we show that as the number of threads grows larger, the rela- 
tive difference between the memory models becomes smaller 
and smaller. 

3.1 Program Model 

We first describe a process for modeling a typical, ran- 
domly reordered program. The process proceeds in two 
phases: program generation and program reordering. 

3.1.1 Program Generation 

We model an initial program based on the canonical atom- 
icity violation bug described in ^2.21 The program is a se- 
quence S of memory operations xi, X2, ■ . ■, Xm, Xm+i, Xm+2, 
where each Xi has type T{xi) £ {LD,ST}. 2:^-1-1 and 2:^-1-2 
are Lines 1 and 3 of the canonical bug, respectively. Since we 
are only concerned with memory operations, we omit Line 2 
(which accesses only the local variable loc), and we will use 
the terms instruction and memory operation synonymously 
in this paper. We assume for simplicity that that only Xm+i 
and Xm+2 access the same location0 We will call Xm+i the 
critical load and Xm+2 the critical store. An initial program 
order So starts with a random sequence of m independently 
distributed LD and ST operations; T{xi) — ST with prob- 
ability p and LD with probability 1 — p. Furthermore, for 
convenience in the analysis, it will be useful to approximate 
a very long program by letting m — > 00. 

3.1.2 Instruction Reordering: The Settling Process 

Different memory models allow for different forms of in- 
struction reorderings. We model this relaxation of program 
order using a probabilistic settling process. This random 
process models instruction reordering by taking a (random) 
initial program order as input, and producing a reordering of 
that initial program. The settling process takes into account 
which kinds of reorderings are allowed by the memory con- 
sistency model under consideration, and generates a random 
program order that is allowed to occur given the kinds of re- 
orderings. In this section, we give an informal description 
of the settling process; a formal definition is given in Ap- 
pendix |Xj2l Figure [1] presents a visualization of the settling 
process. 

Given an initial program order So, the settling process 
proceeds in m-|-2 rounds. In the rth round, (1) the program 
order Sr-i from the end of the (r — l)st round is taken as 

^If two instructions access the same location, they cannot 
reorder, so this assumption simplifies our analysis. 



Figure 1: An instantiation of the settling process under TSO. LDs repeatedly settle upward with probability 
1/2. If they fail to settle, or encounter another LD, they stop permanently, and the next-lowest LD begins. The 
black boxes represent the critical instructions. The grey outlines indicate the currently settling instruction. 
The bottom four instructions in the final order form the critical window. 



the input, and (2) the rth instruction is settled in this pro- 
gram order, which (3) creates the new program order Sr- 
The final output of the settling process is the program or- 
der Sm+2 after settling the critical store Xm+2- Settling the 
rth instruction in round r of the process works as follows. 
Instruction Xr is recursively reordered (that is, swapped in 
the current program order) with its preceding instruction 
(initially, this is the instruction at position r — 1), until a 
reordering "fails," in which case x,- remains at its current 
position in the program order. A reordering always fails if 
the memory consistency model does not allow two opera- 
tions of this type to be reordered. Otherwise, the reordering 
succeeds with some fixed probability s, and fails with prob- 
ability 1 — s0 When a reordering fails, we move onto the 
next round. 

For ease of exposition, we will set both probabilities p 
(from program model) and s to be 1/2 in subsequent sec- 
tions. However, note that as long as s and p are constant, 
the key theorems and conclusions derived in this paper re- 
main fundamentally the same (though some of the numerical 
values change somewhat). 

Examples: In SC, no instructions are allowed to be re- 
ordered; hence Sm+2 = So. In WO, all types of reorderings 
are allowed, so, starting from instruction 2 in the initial pro- 
gram order, each instruction is settled using a series of swaps 
with its preceding instructions, until with probability 1 — s 
a swap fails. Then the next instruction is settled, and so 
forth. TSO relaxes only the ST — 7> LD ordering, which in 
our model implies that a LD may reorder with a preceding 
ST with probability s, but all other types of reorderings fail. 

We will represent the result of a settling process by a per- 
mutation on the indices. For thread k, tt'*' (i) : [1,2,..., m+ 
2] — >■ [1, 2, . . . , m -I- 2] maps the instruction starting at posi- 
tion i to its final settled position. 

The settling process has two key features: (1) memory 
model constraints are enforced (two operations can reorder 
only if allowed by the memory model), and (2) reorderings 
that are allowed occur with a fixed likelihood. One effect of 
the latter property is that in the final program order, most 



more general form of the settling model allows different 
nonzero probabilities for different kinds of reorderings, de- 
pending on the types of memory operations involved. For 
example sld,ld can be different from sld.st, even if both are 
nonzero. 



instructions will not to move too far from their position in 
the initial program order. The critical property of a memory 
consistency model that we seek to capture is the degree to 
which individual instructions can reorder beyond other in- 
structions, and thus move further away from their original 
position. 

3.2 Thread Interleaving Model 

We describe a second high-level random process, which is 
used to determine the interleaving of n threads when they 
are executed simultaneously on a multiprocessor. In fact, 
the process is quite general, and may be of independent in- 
terest as a probabilistic model. We first describe it in the 
abstract, then discuss how it will be used to determine the 
effect of the program model's output on the probability of 
bug manifestation. 

Definition 1. Consider a sequence of n positive line seg- 
ments originating at 0, having integer lengths 7 = 71, . . . , 7^. 
A shift process translates the segments by i.i.d. geometric 
random variables si, . . . , s„. Then the random event of in- 
terest, called ^(7), is the event that the segments are shifted 
such that all are mutually disjoint. That is, 

A(7) := [s„ s, + 7,] n [sj,s, 7i] = V i / 3. 

In SectionO we will analyze the probability of ^(7) for ar- 
bitrary segment lengths 7. However, to connect this model 
to the task at hand, we will go on to think of these seg- 
ment lengths as the critical windows of reordered programs 
generated by the program model. 

Recall that we study a canonical data race, for which cor- 
rect execution requires that each thread's pair of critical LD 
and critical ST be executed atomically. We thus refer to 
the sequence of instructions between the critical LD and ST 
(inclusively) as the critical window of a thread. We let 
be the event that the final ordering of thread Tfc inserts 7 
instructions between the critical LD and ST, (sometimes re- 
ferred to as the critical window growth of a memory model). 
Manifestation of the data race corresponds exactly to the 
event that when the reordered threads are executed in paral- 
lel, some pair of critical windows are not executed disjointly. 
We let A refer to the event that critical windows are disjoint. 
One can then think of Pr[i3^] and Pr[^] as the two funda- 
mental values we seek to characterize in this paper - each 
a measure of the vulnerability of a memory model to this 
canonical data race. 



The shift model is used to simulate the parallel execution 
of the critical windows of each thread, under the following 
assumptions. All threads are assumed to initially be iden- 
tical copies of a single program, generated randomly as in 
Section [3. 1.1 1 Each thread is then independently reordered 
according the process of Section 13.1.21 We then simulate 
the parallel execution of the reordered threads by placing 
the final instruction of each critical window the origin of the 
number line (here representing time in reverse, with being 
the final time step of execution), and using the shift model 
of Definition [T] to model the varying rates of execution of 
each thread. After shifting, the execution of each instruc- 
tion is assumed to take one unit of time; instructions begin 
and end synchronously across all threads, in lock-step. We 
assume that instructions instantaneously read the current 
state of the system at the beginning of the time step, and 
instantaneously commit their changes at the end of the time 
step. In this way we ensure a clear semantics for the state 
of the system at any given time: when a LD executes, it 
observes all the effects of any ST that completed in a time 
step preceding it. 

We can now observe the circumstances in which a data 
race manifests. There must be two threads such that, sub- 
sequent to reordering, the final regions of time steps between 
the critical LD and ST (inclusive) overlap with one another. 
In this case the data race must manifest, because one of the 
LDs must observe a value after (or simultaneous to) the other 
LD being observed, but before the other ST has committed. 

A formal definition and a graphical visualization of the 
shift process is in Appendix lA. 31 (see Figure [2]). 

4. THE CRITICAL WINDOW 

In this section, we study what is perhaps the core com- 
ponent of our random process, and the only one that di- 
rectly distinguishes the memory models: the reordering of 
instructions within an individual thread. In particular, we 
are interested in the final distribution of the size of the crit- 
ical window between the critical LD and ST. For the ex- 
treme memory models of Sequential Consistency and Weak 
Ordering, we are easily able to exactly characterize this dis- 
tribution. The bulk of the technical challenge of this section 
(and consequently of later sections) is in establishing results 
for the more subtle model. Total Store Order. By carefully 
conditioning on several auxiliary random variables, lower 
bounding complex algebraic terms by their low-indexed val- 
ues, and utilizing a bound on the partition number of certain 
integers, we derive rather sharp approximations for the dis- 
tribution of the critical window size. These bounds will in 
subsequent sections be plugged into derived formulae for the 
probability of bug manifestation, as a function of the thread 
interleaving process. Though the results in this section are 
tailored specifically to the thread generation and reordering 
processes specified in the previous section, it is worthwhile 
to observe how the asymptotics of the overall bug manifes- 
tation probability will not depend delicately on the details 
of this process. 

We will be estimating the critical window growth, PrfS:^], 
for a select set of memory models. Recall that B:^' is the 
event that the thread Tk inserts 7 instructions between the 
critical LD and ST in reordering. Because we will be consid- 
ering a single fixed thread in this subsection, we will refer 
to the event B:^ by B-y, and the permutation tt'''-' by tt. The 
first two memory models can be considered a warmup, for 



the substantially more challenging case of Total Store Order. 
All of these results are captured in the following theorem. 

Theorem 4.1. The critical window growth behaves ac- 
cording to the following functions: 

• Sequential Consistency: 



Y'r[B-, 



• Weak Ordering: 

Pr[B,] = 

• Total Store Order: 

'2/3 



1 1/7 = 0, 
if-y>0. 



2/3 if 7 = 0, 

(2-^)/3 j/7>0. 



Pt[B,] = 



i/7 = 0, 

(6/7) ■ 4-^ + i?(7) • 2-^ ^/7>0, 



for non-negative approximation term Rij) < ^ . 

Observe that the critical window grows at vastly different 
rates across the models. Up to lower-order terms, the prob- 
ability of a window size 7 is 2 ' in Weak Ordering, (2 
in Total Store Order, and in Sequential Consistency. It 
remains to be seen in later sections the extent to which this 
window size effects bug manifestation. 

Proof (Theorem 14. II — Sequential Consistency). 
Under sequential consistency, no instruction is ever allowed 
to reorder. Hence Pr[Bo] = 1, and Pj:[B.,] = V7 / 0. □ 

We next consider the case of intermediate difficulty: Weak 
Ordering. 

Proof (Theorem 14.11 — Weak Ordering). 
Under weak ordering, all four ordered pairs of instruction 
types are allowed to pass one another. Recall that we as- 
sume a strong normal form, in which all possible swaps occur 
with probability 1/2. Hence in weak ordering, each subse- 
quent instruction continually moves up with probability 1/2, 
until it ever fails to swap. This applies to the critical load 
and critical store as well, with the exception that the critical 
store will never pass the critical load, (because they access 
the same address). To calculate the probability, we condi- 
tion on the resting position of the critical LD, which entails 
a given resting position for the critical ST, for any 7 > 0. 

Pr[B^] = Pr[7r(m + 2) - 7r(m + 1) = 7 + 1] 

00 

= ^ Pr[7r(m +1) =m + l-i] 

■ Pr[7r(m + 2) = m + 2- i-|-7| 
7r(m + 1) = m -I- 1 — i] 



E2 



-(i + l)r,-(i + l-7) 



We must handle the case of 7 = separately, because here 
the critical ST stops moving "automatically," when it runs 



up against the critical LD. 

oo 

Pr[Bo] = Pi'[^("i + 1) = m + 1 - i] 

• Pr[7r(m + 2) = m + 2 - i\n{m + 1) = m + 1 - i] 

OO 

^ ^2-("+''2'<'' = 2/3. □ 

Finally we turn to the far more challenging setting of Total 
Store Order. 

Proof (Theorem [471] — Total Store Order). 
One of the strongest and most commonly used relaxed mem- 
ory models, Total Store Order (TSO) only permits loads to 
swap with stores. Hence in calculating the distribution of 
window size, we need only consider the number of stores 
located directly before the critical load. Those stores will 
never move themselves, and the critical load can never swap 
past the first load above it. Moreover, the critical store never 
swaps with anything, so its final position is fixed. 

However, deriving bounds on Pr[_B^] is difficult. LD oper- 
ations may reorder past ST operations, thus pushing longer 
sequences of ST operations together. In this section we de- 
rive bounds on the critical window growth for TSO, which 
is a core technical contribution of this paper. The proof is 
quite involved. Much difficulty arises in gaining control over 
the relative positions of LDs and STs. We outline the steps 
taken to estimate the critical window growth below. The 
majority of these steps are non-trivial, and often involve a 
delicate case analyses. 

Proof Outline. 

1. Express the critical window probability in terms of a se- 
ries of new random variables, L^: the event that the 
second-to-last reordering leaves exactly /i contiguous STs 
above the critical LD. 

2. To calculate the probability of L^, condition on the value 
of another series of random variables, ^l/^: the number 
of LDs initially between the critical LD and the /i + 1th 
lowest ST. 

3. Express the ^t^-conditioned probability of in terms 
of the limit of the fraction of STs near the bottom of a 
reordered thread, and another probability, Pr[_F)j|^'^ — 
q]: the chance of g LDs all reordering out of a region of 
at least fi STs. 

4. To estimate Pr[_F)i|*I'p = q], condition on a new random 
variable. A: the sum, over STs, of the number of LDs 
below each ST. Express the probability of A in terms 
of the weighted sum of several integer partition numbers, 
and estimate these by a simple lower bound. 

5. After combining the above elements to bound the proba- 
bility of Lfj,, lower bound an ugly term of this expression 
by its value at = 1, checking via the derivative that 
this term is increasing in fi. 

6. Use the lower bound on the probability of to finally 
lower bound the probability of a given window size. To 
achieve an upper bound, calculate the total probability 
not attributed to some in the lower bound, and at- 
tribute it to the worst-possible case. 

We now move on to execute this plan in detail. 



Step 1 — Number of contiguous STs above the criti- 
cal LD: Recall that So {Sm+2) denotes the initial (final) 
instruction order, and that Sm refers to the instruction or- 
der just before the critical load is settled. For convenience, 
we define the following basic random events. Let S\_D,i{j) be 
the event that after the jth instruction of Si is a LD. Fur- 
thermore, we define S\_D,iij,k) = Afcj SLD,i{£) as the event 
that the entire contiguous range from j to k in Si consists 
of LDs. SsT,i{j) and SsT,i{j,k) are defined accordingly. 

For /I £ N, we define as the event that in Sm , there are 
exactly /i ST operations immediately preceding the critical 
LD. In other words, 

Lfj, = SLD,ni{m — ^ SsT,m.{in — 11+ l,m). 

The critical LD may only move 7 positions if there are at 
least 7 contiguous ST operations above it. Hence for any 7, 
we have 

00 

Pr[i3^1 = ^Pr[i3^|L^].Pr[L^]. 

fi=7 

Deriving PrlB^-IL^] is straightforward. If = 7, we have 
Pr[_B.y|L7] = 2 as the critical LD must pass all 7 STs. Af- 
ter that, it stops because the next instruction is a LD. For 

> J, we have Pr[_B.y|I/^] = 2"^'*'"'"^^, because the instruc- 
tion above the 7th ST is also a ST. Hence there is only a 
1/2 probability of the reordering completing when it reaches 
that point. 

It remains to derive bounds for Pr[Lp] for all /i. This is 
the primary technical lemma of the proof. 

Lemma 4.2. For any fi > 0, PrfLp] > | • 2~^. Moreover, 
Pr[Lo] = 1/3 exactly. 

Proof. We will approach this lemma by asking (1) how 
many LDs are interspersed among the first fi STs above the 
critical LD, and (2) what is the probability that all of those 
LDs settle such that we are left with fj, contiguous STs above 
the critical LD. Because STs cannot settle past LDs in this 
model, nothing happens during rounds in which a ST can 
move; the technical difficulty arises in the motion of the LDs. 

Step 2 — Number of interspersed LDs: In the initial 
program order 5*0, let <E>p refer to the position of the /ith- 
lowest non-critical ST. Formally, 

= min{i : i{j > i : 5'sT,o(j)}l = M + !}• 

Furthermore, let refer to the number of LD operations 
above the critical LD but below the /ith-lowest non-critical 
ST. That is, 

"i/^ ^m + l- fi- $,1. 

Note that as the program length goes to infinity, the prob- 
ability that such a and exist goes to 1. Now we can 
express Pr[L,i] as 

oo 

Pr[L^] = ^Pr[L„|*^ = q] ■ Pr[*„ = q]. (1) 

We have Prf*^ = q] = 2~''2~'' (''+^"^) because there are 
C^^'"^) ^ays to build a string of fi STs and q LDs such that 
the top instruction is a ST. 

Step 3 — Probability of interspersed LDs settling out: 

The difficult part of bound P is PrfL^]*^ = <?]. This is the 
probability that 



(A) All q LDs between the ST at and the critical LD 
settle up until they pass the ST at $^ , 

(B) but do not settle so far that the settled instruction 
above the ST at $^ is another ST. 

(|B)) is due to the fact that specifies that there be exactly 
ji STs above the critical LD. The probability of ((B| relies 
on the instruction directly above in 5*4,^ _i. If it is a LD, 
then (|B| holds automatically, since all the LDs must stop 
settling. However, if it is a ST, then (|B]) only holds if not 
all of the q LDs that have passed the ST at $^ also pass the 
next-highest ST. Hence this is the first property on which 
we condition. 

Pr[i^|*^ = g] = Pr[L^ A SLD,i^-i{^^ - = q] 

+ Pr[L^ A 5sT,#^,-i('l>M - = 

By Bayes' Law, 

Pr[L^ A Sld,<i,^-i($^ - 1)1*^ = q] 

= Pr[5ic,*^-i(<l>^-l)l*^ = g] 

•Pr[L^|SiD,*,-i($M-l)A*M = Q]- 

We first consider the latter term. Because the final instruc- 
tion that settles above $p will be a LD under these condi- 
tions, this depends only on the bottom /i instructions settled 
above the critical LD being STs. For shorthand, let 



SsT,m{m — fi+l, m). 



Then 



Pr[L^|&i3,*,-i($„ - 1) A = g] = Pr[F„|*^ = q]. 

In contrast, for to hold given 5'sT,*^-i(<l?fi — 1), it does 
not suffice for the q LDs to move past They must also 
not all settle past the next highest instruction. They do so 
with probability 2"''. Hence 

Pr[iM|5sT,*,-i($M-l)A*M = g] = 

Pr[F^|*^ = q].(l~2-'). 

Putting these expressions together, we find that 

PrfL^I*,. = g] 

= Pr[F„l*^ = q] ■ Pr[5iD,<j,-i($M - 1)] 

+ Pr[F^!v|/^ = g] ■ Pr[SsT,*,-i('3>M - 1)] • (1 - 2"") 

= Pr[F^.|*^ = g] • (1 - 2-' . (1 - Pr[SsT,*,-i(1'M ^ I)]))- 

We first derive an exact value for 'Pr[SsT,i{i)]- Though it 
is difficult to determine the probability that a given instruc- 
tion is a ST in general, this particular value can be derived 
exactly through a recurrence relation. 



Claim 4.3. 



lim Pr[SsTa(i)] = 2/3. 



Proof. After reordering stage i, instruction i can be a 
ST in one of two ways. Either it can initially be a ST, (in 
which case it never reorders) or it can initially be a LD, the 
instruction above it can be settled as a ST, and the two can 
swap. Hence 

Pr[SsT.,(»)] = \ + \- ^r[SsT,^-l{i - 1)] ■ \. 



This is a recurrence relation of the form Xi = h + aXi, which 
has the solution Xi = -I- a'~^(Xi — yt^)- Plugging in 
Xi = 1/2, a = 1/4, 6 = 1/2, we find 



Pr[SsT,.(i) 



+(i/4r-fi/2- 



1 - 1/4 ' ' ' V 1-1/4 
2/3 + (l/4)'-'(l/2-2/3) 



The resulting probability is a function of i, but we are in- 
terested in the steady-state as the size of the program goes 
to infinity. Hence the second term falls out. 

lim PrfSsT.^W] = 2/3. □ 

i — ^oo 

Now that we know the typical fraction of instructions near 
the bottom of the program that are STs after reordering, we 
can derive a bound on Pr[F;_i|*I'p = g]. 

Step 4 — Estimating Pr[F^|*^ = q]: 



Claim 4.4. 



-(9-1) 



Proof. Everything in this proof is implicitly conditioned 
on the event ^'^ = q. Let the random variable 



E 



K^M < J < i : T-ST,o(j)}l 



#j,<i<m;TLD_o{i) 

represent the total number of positions that LDs from $p to 
m must move up, in order to leave a sequence of /x STs imme- 
diately above the critical LD. It must be that A > g, because 
at least instruction <E>p is a ST, and A < /xg, because no LD 
can be required to pass more than /i STs. With this defini- 
tion, we may write Pr[F^|*^ = g] = Eal, Pr[^ = S\ ■ 
The exact value of Pr[A = S] can be stated formally, but 
not in a closed form. Namely, let (^(a;, y, z) be the number 
of distinct multi-sets of y positive integers summing to x, 
such that each integer is at most z. This is a variant on 
the much-studied partition number of x. Then (j){S,q,n) is 
exactly the number of arrangements of q LDs and jj, STs (be- 
ginning with a ST) such that 5 is the sum of the number of 
STs above each of the LDs. (For each LD, we simply select 
how many STs to place it below — the relative order of the 
LDs is immaterial.) There are ("^+9-1^ total arrangements 
of LDs and STs beginning with a Hence 



and 



PrfF,J*, 



S, q, m) 



Simple forms for (j){x,y,z) are not known. Asymptotic 
results exist, but are not helpful here because the terms with 
small parameters have the largest contributions. However, 
to achieve a good bound it suffices to show that <jf>(5, g, /i) > 1 
when q < 5 < i-iq. To show that a partition exists that 
achieves any number in this range, consider the following 
construction. Set 5 mod g of the integers to \5/q], and set 
the rest of the integers to [(5/gJ . We can set the integers this 
large, because 5/q < {fj.q)/q = /x. Then the chosen integers 



sum to {5 mod q) \5/q] + (g — (5 mod q)) [5/q\ which can be 
shown to be exactly 5. Hence we may write 



-(9-1) 



4 = , 



□ 



Having derived a bound for Prfi^^l*]/^ = g], we are now in 
a position to conclude the proof of Lemma 14.21 First note 
that Pr[Lo] = 1/3, by Claim For values of jj, greater 
than 0, Claim im will be the central tool in the proof, which 
is left to Appendix lB.il □ 



The remainder of the proof of Theorem 14.11 steps 5 and 6, 
is deferred to Appendix lB.il □ 

5. SHIFT PROCESS 

Here we discuss the next component of our analysis: a 
"shift process" meant to capture the interleaving of reordered 
threads. We refer the reader back to the definition in Section 
13.21 This process is where the critical windows derived from 
the reordering process come into effect. 

In the analysis that follows, we assume that each critical 
window's shift is distributed geometrically, representing the 
intuition that threads are exponentially less likely to execute 
at progressively increasing offsets from one another. Let 
7 = (71, 72, . . . , 7„) G N" be a sequence of integral "segment 
lengths." In subsequent sections, 7fc will be used to represent 
the length of the critical window of thread Tk . We define a 
shift process on 7 as follows. Consider n segments of the line, 
of lengths 71, 72, . . . , 7n, and let the starting point of each 
segment be shifted up from by an i.i.d. positive random 
variable Si. We are interested in the probability that the 
resulting set of shifted segments is non-overlapping. In other 
words, we would like to bound Pr[yl(7)], where ^1(7) is the 
event that Vi / j G {1,2, ...,n}, we have [si,Si + "fi] n 
[sj,Sj+yj] = 0. 

The following theorem states this probability precisely, 
and as such is not particularly enlightening on its own. How- 
ever, when the segment lengths are random variables drawn 
from a well-understood distribution (as they are in the case 
of reordered random threads), we will be able to state the 
probability concisely. 

Theorem 5.1. 



Pr[A(7)] 



(n-i)7„(i) 



where Syrrin is the symmetric group of degree n: the set of 
all permutations on n elements. 

The following corollary simplifies this expression: 



Corollary 5.2. For some c{n) e [2,4], 



Pr[A(7)] = c(n) ■ 2 



E n2 

afHSyrrin i— 1 



("-«)7<,(i) 



In particular, c(2) — | exactly. 

The proof of the corollary is in Appendix lB.2l We now turn 
to the proof of the main theorem. The challenge is to char- 
acterize the probability that the next segment is shifted to a 
position disjoint from all previous segments. At first glance, 
it is difficult to handle the huge and diverse set of legal place- 
ments for a set of segments. Our key insight is to condition 



on the relative order of the magnitude of the shifts. We 
then consider the probability that each segment is disjoint 
from the previous threads in this order. In so doing, we are 
able to exploit the memorylessness of the geometric distri- 
bution. Let t be an arbitrary segment, and t' be the segment 
immediately preceding it in this order. To understand the 
distribution of disjoint placements for t, we need only know 
the distribution of the origin of t' . Then by assuming that 
the segments are disjoint, we can infer that the origin of t is 
distributed according the origin of t' , plus the length of t' , 
plus an independent geometric random variable. 



Proof (Theorem [5lT|). Let Si be a geometric random 
variable with expectation 2 (i.e., Si = k with probability 
2-('=+i) Vfc e N). In order to analyze the probability of 
j4(7), we will take the following steps. We will first con- 
dition on the ordering of the segments. Then for a given 
ordering, we will use the memorylessness of the shift vari- 
ables to calculate the probability of each successive segment 
being disjoint from each previous. 

For a permutation cr on {1, 2, . . . , n}, let Fo- be the event 
that for all i, the ith largest shift occurs on segment a{i). 



That is, s 



<t(1) 



t(2) 



> S 



(T(n)- 



Then Pr[yl(7)] 



We now analyze Pr[A(7) A Yo-]. We will refer to this event 
by A(7, f). For all segments to be disjoint, it must be the 
case that each segment begins after the end of every seg- 
ment that began before it. a captures exactly the order in 
which segments begin. So disjointness means that for all i, 
j s.t. a{j) > o-{i), segment j begins after the end of seg- 
ment i. Hence for each i, we may condition on the shift 
of the segment with the zth largest shift, and consider the 
probability that each segment with a smaller shift follows its 
completion. 

oo 

Pr[^(7, a)] = J2 P'^[^(7, ^) A = h] 

oo n 

= Pr[A{j, a) A s,(i) = hA /\ s,(,) >h+ 7a(i)] 

ei=0 i=2 
oo n 

= Pr[A(7, cr)|s<,(i) = ^1 A /\ >h+ 7<T(i)] 



ll=0 



The third equality is due to the independence of the shift 
variables. Let 7' refer to the restriction of 7 to the seg- 
ment indices with the n — i + 1 smallest shifts (i.e., 7' = 
7|[n]\U"_ CT(j))- Similarly, let a' refer to the restriction of a 

to the n — i + 1 smallest shifts (i.e., cr' — ai[„]\[i-i])- We 
define these structures so that we can express the disjoint- 
ness event in terms of a new disjointness event on a smaller 
set of unconditioned segments. In particular, let j4(7',cr') 
be the disjointness event for an independent random shift 
process on segments a{i),a{i -I- 1), ... , (j(n), with permuta- 
tion (T* pointing to the new indices of these segments. We 
will see that we are permitted to condition on such a prior 
event, because of the memoryless of the shift variables. 

Conditioned on the first segment being disjoint from all 
the following segments, we need only consider the event 
A{~f^,a-). Then due to the memorylessness of the shifts. 



we have 

71 



3^2 



= Pr[^(f+\a'+^)!/\s„.,,.)>^i+7..(i)] 



J=2 



= Pr[yl(f+\a'+^)| /\ > 0] =Pr[A(f+\a'+^)]. 

We now observe a simple recurrence relation that defines 

oo 

Pr[^(f , a^)] = ^ Pr[A(f +\ ■ Pr[.,.(i) = £,] 



oo 

oo 

+1) 

= ^ . -1' • Pr[vl(f 

Moreover, it is clear that Pr[yl(7", ct")] = 1. Then noting 
that (t'(1) = the solution is trivial: 

'1-1 r,-(n + l-i)-{n-i)7^(i) 

Pr[^(7\a^)] = n^ 



2-(„+i-i) 



■Q 2-("-i 



nr=i(i-2-("+i-'') fi 

Finally, plugging these terms into the overall probability of 
disjointness yields the expression in the theorem. We will 
use this expression in the next section to calculate the prob- 
ability of bug manifestation. □ 

6. JOINING THE MODELS 

We have now described the two fundamental random pro- 
cesses of our work. Though the two are interesting in isola- 
tion, it is by combining them that we will achieve our overall 
goal: to characterize the probability of the canonical data 
race manifesting, under various memory models. 

Our first observation is to note that Corollary 15.21 can be 
further simplified, provided the segment lengths are drawn 
from a distribution with a very weak condition. 

Theorem 6.1. Let T = Fi, . . . ,F„ be a distribution over 
segment lengths, drawn from N" . Assume that the marginal 
distribution of each segment length ts identical (i.e., Ti ~ 
Tj V i 7^ j); they needn't be independent. Then all permuta- 
tions of segment shifts are equivalent, and 

n-l 

Pr[A(r)] = c(n) • 2-("^') • n! • Ep[]^ 2"'^']. 



The proof is given in Appendix lB.3l Because the identical- 
ity condition holds for the critical window size, the theorem 
gives an indication of how it is that we can analyze the 
overall bug manifestation concretely. Recall that the pro- 
cess of Section U generates a uniformly random program of 
STs and LDs, then randomly "settles" each instruction in 
turn, according to the rules of the memory model. The pro- 
cess of Section [5] applies a random "shift" to a series of line 
segments, the key event for which is the mutual disjointness 
of all the segments. We now combine these two processes by 
letting the line segment lengths of the shift process be dis- 
tributed as the critical window size of the settling process. 
An important subtlety is that we generate a single initial 
random program, then independently reorder n copies of this 
program. Though this makes the analysis more complex, it 
adds a degree of realism: with n identical threads, it is more 
natural that the same data race would be present in the 
same position of every pair of threads. The following two 
theorems summarize our key results. 

Theorem 6.2. For n — 2 threads, the probability that the 
canonical data race does not manifest is the following, in 
each of the three main models. 



Sequential Consistency: 
Total Store Order:^ 
Weak Ordering: 



Pr[^] ^ 0.1666 
0.1369 > Pr[^] > 0.1315 
Pr[A] ^ 0.1296 



Theorem 6.3. As n grows, the probability of successful 
execution is identical in all models, up to lower order terms 
in the exponent. In particular, Pr[A] 



-n'={l-|-o{l)) 



The first tightly bounds the probability of successful exe- 
cution for the case of n = 2 threads; the second gives an 
asymptotic bound on this probability for large n. We leave 
the proofs of these theorems to Appendix lB.3l Both proofs 
are rather technical and build upon the theorems of the pre- 
vious two sections. The only surprising observation nec- 
essary is that, when lower bounding a certain expectation 
over the critical window for n threads, it suffices to use only 
a single term of this expectation. Doing so achieves the 
asymptotic behavior we seek. 

Key Observations: Interpreting Theorems 16.21 and 16.31 

yields remarkable insights. Though the case of n = 2 sub- 
stantively distinguishes the memory models, we find that as 
n grows, the probability in all memory models approaches 
the same value, up to lower order terms in the exponent. 
This dichotomy is a fundamental take-away for informing 
computer architecture decisions. Though the use of weaker 
memory models does increase the risk of program error, as 
the number of threads grows this risk grows negligibly com- 
pared to growth of risk of error in even sequential consis- 
tency. This is of particular importance given the trends 
towards ever larger multicores that enable more and more 
concurrent threads. 

7. DISCUSSION 

Limitations and possible extensions: Our analysis as- 
sumes that the program consists solely of loads and stores, 
when real programs include synchronization, arithmetic, etc. 
These instructions can affect the timing of the program, in- 
troduce data dependencies that limit reordering, or disallow 

^A very similar analysis achieves a similar result for Partial 
Store Order (PSO). We omit the result for brevity. 



certain types of reorderings. An important item for future 
work is to include acquire/release fences, which are necessary 
to simulate memory models such as Release Consistency [TT] . 
These fences act as one-way barriers, allowing instructions 
to reorder into, but not out of, a critical section. This be- 
havior can be easily modeled using settling ( ^3.1.2|) . Fences 
make concurrency bugs less likely to manifest, as programs 
with fences have fewer legal reorderings. However, we con- 
jecture that adding fences will not significantly change the 
main conclusions derived in this paper. 

Optimized implementations of SC: Our model of Se- 
quential Consistency assumes a relatively simple implemen- 
tation wherein each processor executes only one memory 
instruction at a time. Many SC implementations use ag- 
gressive optimizations such as speculative execution to com- 
pete with the performance of weaker memory models I10U121 
[3- We do not consider this simplifying assumption to be a 
weakness of our model; rather, we believe our results about 
weak memory models can be extended to address optimizing 
implementations of strong memory models. In other words, 
concurrency bugs are more likely to manifest in an imple- 
mentation of SC that uses aggressive reordering than in a 
simple (and slow) implementation. 

Generality of Results: In this paper, we propose and 
study one specific probabilistic process to model program 
execution and thread interleaving. Clearly, there are other 
plausible models that can be studied. Our intuition is that 
the results in this paper have a certain robustness with re- 
gard to changes to the parameters in our models as well as 
to changes in the model. However, future work is required 
to formally validate this conjecture. 

8. CONCLUSION 

With the ubiquity of multicore systems and the trend to- 
wards integrating every more cores on a single chip, multi- 
processor programmability has become one of the key chal- 
lenges in computer science. Even with improvements in pro- 
grammability, we are likely to see an increase in software 
defects, given the inherent difficulty of concurrent program- 
ming. Memory consistency models are at the center of the 
programmability discussion, since they determine the mem- 
ory access semantics of parallel programs. The debate over 
memory models has historically revolved around the trade- 
offs between programmability, performance and complexity. 
In this paper we bring a new axis to this discussion: soft- 
ware reliability. We study an analytical model and show that 
concurrency bugs are indeed more likely to manifest them- 
selves in relaxed memory models, but surprisingly, that as 
the number of parallel threads increases, the difference be- 
tween harsh and weak memory models diminishes. The lat- 
ter observation can have important consequences on system 
designers when developing new memory models. 
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APPENDIX 

A. MODEL DEFINITION 
A.l Initial Program Order 

The initial program order So consists of n + 2 instructions: 

Xl,X2, ■ ■ ■ ,Xn,\-D X,ST X 

where for 1 < i < n, Xi has type T{xi) — ST with probabihty 
p, and type T{xi) — LD with probability (1 — p). Each Xi 
accesses a location Xi such that Xi — Xj only if i = j, and 
Xi ^ X. For the purposes of defining the model we assume 
that n is finite, but in the analysis it is useful to approximate 
a very long program by letting n oo. 

A.l Definition of Settling Process 

We model instruction reordering as a random process con- 
sisting of n + 2 rounds. This process produces a permutation 
of So which we call Sn+2', round i produces the intermediate 
permutation Si. During round i, instruction Xi is inserted 
into the permuted ordering of instructions xi through Xi-i. 
We decide where to insert instruction Xi by repeatedly swap- 
ping Xi with the instruction directly before it. Each swap 
succeeds with probability Pti,t2, where ri is the type of the 
instruction directly before Xi's current location and T2 is the 
instruction type of Xi, and fails with probability 1 — Pti,t2- 
Pti,t2 is always either or s, depending on the memory 
model. The single exception is for the critical LD and ST, 
Xpien+i and Xm+2- If Xm+2 evcr tries to swap with x,n+i, 
it automatically fails, because they access the same memory 
address. The round completes when a swap fails occur or Xi 
reaches position 1. This recursive random process is called 
settling. 

Let ni{j) be a function from positions in So to positions 
in Si. (Note that no{j) = j for all j.) We formally define 
the insertion point of instruction Xi using the probability 
distribution j3i. 

Definition 2. Given the intermediate permutation Si-i, 
we define a probability distribution fii^k as follows: 

• If fc = 1, 1. 

• Else let j = 7rri\(A; - 1) and let q = p-r(a:j),T{xi)- 

— k with probability 1 — q 

— Draw from /3i,fc-i with probability q. 
We also define pi to be 

Pi describes the distribution of possible positions for in- 
struction i after round i of settling. Pi^k describes the distri- 
bution of the possible positions for instruction i given that 
i moves up at least as high as position k. 

The result of round i is the permutation tt^, in which 
the instructions following x^'s new location are each pushed 
down by one, and the instructions preceding Xi's new loca- 
tion do not move. 

Definition 3. Recall that ni is a function mapping posi- 
tions in So to Si. Given permutation Si-i, we draw k from 
Pi and construct the permutation Si as follows: 

TVi{i) = k 

TTi{j) = TTi-l{j) for TTi^lU) < k 

TVi{j) = TTi-iU) + 1 for 7ri_i(j) > k 



We use definitions[2]and[3]to get a probability distribution 
over permutations of So- We refer to the final permutation 

■Km+2 as TT. 

A. 3 Definition of Interleaving Model 

Formally, the thread interleaving model is defined as fol- 
lows. Let threads Ti , . . . , T„ be n identical threads, dis- 
tributed as described in lA.ll 

We allow each initially-identical thread to reorder inde- 
pendently, using the settling process of IA.2I We refer to 
the final permutation TTm+2 of thread Tk as tt^'^K Define the 
"critical window" Wk for a reordered thread Tk to be the set 
of indices (inclusively) between the settled positions of the 
critical instructions. E.g., 

Wk = {7r('='(m + l),7r('='(m-|-l)-H,..., 

7r('='(m + 2) - l,7r<'='(m + 2)}. 

Finally, we independently allow each thread to "shift up" 
with respect to one another. 

For each k, we allow thread Tk to shift exactly i positions 
up with probability 2" (^+1^ Observe that 2" '^+1^ = 1, 

so as n — > oo, this gives a probability distribution over the 
positions of each instruction in each thread. Let rik be the 
shift of Tfc. 

We then say that the bug manifests if there exist k ^ 
£ such that the critical windows of reordered Tk and Ti 
overlap whatsoever. In other words, define the bug non- 
manifestation event A by 

A = ^3k^e: {TT'-'\m + 2) - r^e > tt*'"' (m -f 2) - ryfc) 
A {-K^^\m + 2) - rie < Tr^''\m + 1) - r/k) . 

(Note that for any overlapping pair of ranges, the bottom 
of one of the two windows will necessarily overlap with the 
other window.) 

Expressed alternately, let Wk be the shifted window 
{7r('='(m + 1) - r?fe,7r('=)(m + 1) + 1 - r/fc, . . .,TV^''\m + 2) - 
1 -r7fc,7rW(m + 2) -r?fc}. Then 

A = ^3kj^£:WknWij^ 0. 

B. PROOFS 

B.l Proofs for Theorem 1411 : Critical Window 
Growth 

In this section, we finish up the proofs for Lemma [4.2l and 
the Total Store Order case of Theorem 14. II 

Proof (Remainder of Lemma [¥72|). 
Pr[L^]=Pr[L^|*^=g]-Pr[*^=g] 

9=0 V / 

• (1 - (1 - g(/i, q)) ■ PrfSsT.^v-iC-fM - 1)]) 

9=0 \ ^ J 











V2 



Figure 2: An instantiation of the shift process. Three segments of lengths 7 — (71,72,73) = (3,2,5) are 
independently shifted. This particular shift occurs with probability 2 **-! 2 -2"^"^ = 2~". The disjointness 
event ^(7) does indeed hold here. 
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The above expression is difficult to work with, so we give 
a simpler lower bound that holds for all p > 1. Let h{fi) be 
the parenthesized expression above (such that the bound is 
Pr[L^] > 2-'' • h{fi)). 

We differentiate h{fi) and show that it is increasing, and 
compute a small value explicitly, so that we may lower bound 
all higher values by this value. 

We will show that for all /i > 1, 



h(^) > 4/7 



We first note that 

h(l) =8/7- 



1-1/4 3 
= 8/7-4/3 + 16/21 
= 4/7. 



1 - 1/8 



ity bounds. Let Pr£[L,t] be the lower bound for Pr[Lp] com- 
puted in this section, and Prr[-£/^] = PrfL^] — Pr^ [L^] be the 
remainder. The "missing" probability R = X])^o P'^'' I^m] is 
computed below. 



Claim B.l. 



R = 2/21. 



Proof. Note that Pr,, [I/^] is nonnegative for all fi, be- 
cause Pr, [I/^] is a lower bound. L,i must hold for exactly 
one hence J2'^=o P^'I-^m] = 1- 

00 

00 

= 1 - Pr[Lo] - ■ 

M=l 

= 1-1/3-4/7-1 
= 2/3 - 4/7 
= 2/21. □ 

Now, using the derivation of Pr[I/o] from Lemma [4.21 

Pr[Bo] = Pr[Bo|io] ■ Pr[Lo] + Pr[Bohio] ■ PrhLo] 
= 1 ■ (1/3) + (1/2) . (2/3) 
= 2/3. 

Moving on to the case of 7 > 1, we rewrite Prfi?^.] as 



We now show that h(fi) is increasing in fi. The function Pr[B^] = ^ Pr[B^|L^,] ■ Pr[L^] + ^ Pr[B^\Lf,] ■ Pr[L^]. 



+ 



h{fi) is defined as h{fi) = f - (1 - 9-{m+i)^-i 
2-(m+2)^-2^ To see that h{fj,) is increasing, we differentiate 
w.r.t. fi. 

d 2-^^+^)(ln2) 2 2-'^+^Hln2) 

d/i^'^' " (1 - 2-("+i))2 3 ■ (1 - 2-(''+2))2 

This expression is positive when 

1 4 

> 



(l_2-(f+i))2 3(1 -2-(f+2))2' 

^•S- il2-(f-+i) > \/|- This holds for all ^ > 0. Hence h{fi) 
is descreasing for non-negative /i. □ 

Proof (Remainder of Theorem 14. II) . 
It will be useful to calculate the total slack in our probabil- 



We compute the first sum exactly, and provide upper and 
lower bounds for the second sum, in order to upper and 
lower bound Pr[_B-y]. 

First we compute the value of 51^^.^ ^AB-rl^iJ-] ' Pi'f [^m] 
exactly: 

00 

^Pr[B,|iJ.Pr[L^l 

00 

= Pr[B^iL^].Pr[L^]+ ^ Pr[B,lL^] ■ Pr[i^] 

00 

= 2-'^/i(l)2-'^ 2-'^(l/2)/i(l)2-^ 



Ml)2--(2--+ ^ (1/2)2-' 

0-(7 + l) 

Ml)2--(2-- + (l/2)i— — ) 



1/2 



= /i(l)2-^-3-2-(^+') 
= 3/i(l)2-<2^+') 



■ 4" 



Next we upper bound Yl'^^-^ Pr[_B^|Lp] • Pr, [L;j]: 

oo 

^Pr[B,|i^].Pr[L^] 

Z / J. 

oo 

= 2--.Pr[L,]+ ^ 2-^(1/2). Pr[LJ 

oc 

= 2-- . (Pr[L,] + (1/2) ^ Pr[L,]). 
To upper bound the above expression, observe that 

oc 

^Pr[L^] <i?. 

A — ^ r 

It is clear that allocating all of this probability mass to 
majcimizes the above expression: 

oo oo 

Y^Vt[B,\L,] ■Vr[L,]< 2-' ■{R+ {1/2) ^ 0) = ffi"^ 

We cannot ensure that X]^7 Pr^fl/^] is positive for 7 > 0, 
because all of R could be allocated to Pr, [I/o]- Hence the 
best lower bound we can expect here is X]^^ Prf-B^ ■ 
Pr,[L^] > 0. □ 

B.2 Shift Model Proofs 



Proof (Corollary [5T2|). For general n, it suffices to 
show that nr=i'(l - 2-("+i-'') > 1. 

n — 1 n 

2-("+'-">) = - 2-') 

n 



1 + 



i=2 ^ ^ 1-2-' 
1 

~ n exp(y^ 



/ 4 1^%_. 



> exp 



> 



To check the value of c(2), we simply plug in n = 2. That 
□ 



'''i (i_2-(2+l-l)) 3- 

B.3 Proofs of Final Theorems 

Proof (Theorem 16. II) . Recall from CoroUarv 15.21 that 



Pr[A(r)] = c(n) • 2' 



Our goal is to average over the summation of permutations. 
Since we are treating F as a random variable, 

n-l 

Pr[^(F)] =Er[c(n) •2-("^') ■ ^ ]^ 2-<"-''^-W] 

CTes„ 1=1 

n-l 

= c(n) • 2-("J') • lEr[n 2-<"-''^''W] 



Then 



Er[n 2"'""' 



<tSS„ 1=1 



n-l 



^ E n 2~("-''^-(") ■ Pr[Br]. 
r 1=1 



Let (t(F) : N" — > N" be the operation mapping F to F' with 
entries permuted by a. Define the inverse cr-^(F) accord- 
ingly. Then note 



J2 n 2-<"-'^^"C) ■ Pr[Br] 
r 1=1 

n-l 



-(n-i)r- 



Pr[B, 



-i(r')J 



But because threads are distributed identically, the distri- 
bution of F is symmetric over any ordering a: Pr[_Br] = 

nLi P'^K'l - nLi P^[B%J = Pr[B.(r)]. 
Hence we may write 

n-l 

^j-j2-(-)^;.Pr[B.-i(P,)] 



J2 n 2-<"-''^- • Pr[Br' 
r' i=i 

n-l 



^(n-i)r'^. 



There are n! permutations, hence this proves the claim. □ 



Proof (Theorem 16.21) . First observe that for any F con- 
sisting of two segments, by Corollary 15.21 



Pr[^(F)] = c(2) ■ 2"('^') ■ Yl 2'^^-' 



tr 1=1 

) 



We then let F be distributed as the critical windows of two 
reordered copies of an identical random program. 

Pr[^] =Er[Pr[A(F)]] = Er[i • (2"^^ +2-^^)] = ^■Er[2-^i]. 



Note that B-y is the event that 7 instructions end up between 
the critical LD and ST exclusively, yet the critical window 
includes the critical LD and ST. Hence 



Hence 

00 

E[2-^i] = ^Pr[Bt-2]-2- 



Er[2-^i] =^2-'=-Pr[Bfe_2]. 



Sequential Consistency: We first analyze the prob- 
ability of bug manifestation in sequential consistency, the 
strictest of all memory models. In sequential consistency, 
no thread ever reorders. Hence no instructions ever appear 
between the critical LD and ST in a given thread. Thus 



Pr[A] 



Er[2 



-rii 



2 1 

3 ' 4 



Weak Ordering: Under the weakest memory model, in- 
structions have a chance to bubble up regardless of whether 
its preceding instruction is a LD or ST. For this reason, the 
final size of the critical window is independent of the original 
program, as it was in sequential consistency. 

Recall from Theorem 14. II that under Weak Ordering, 



Pr[B,l = — 



if t > 0, and Pr[Bt = 0] = f . Thus 



E[2-^i] =^Pr[Bt_2] -2-* 

°° 0-(t-2) 

= (2/3) ■ 2- V ^ ^— ■ 2-* 



t=3 



2/12 + -.^4-' 

t=3 

4 14 



1/6 

T_ 
36 



3 64 3 



Then 



^ ^ 3 36 54 



Note that the probability of not manifesting has indeed de- 
creased from sequential consistency, — 9/7. Correct 
behavior is somewhat more likely than under sequential con- 
sistency. 

Total Store Order: We take advantage of the symme- 
try for n = 2. We need not characterize the joint distribu- 
tion of the lengths of two critical windows, because only the 
starting position of the lower window matters. 

Recall that Theorem 14.11 shows that: 



Pr[Bo] = 



and 



Pr[B,] = ^ . 4- 
for some positive R{'y) < 



+ i?(7) • 2- 



(2/3) ■ 2' 



(1/6) 



^•16.8-^ 



+ 4^i?(t- 2)4"* 



= (l/'^) + 4+^E^(*^2)4- 



Plugging in R{t) > gives 

Pr[^] = I . E[2-^i] > |(i + A) = 58/441 > 0.1315. 
Similarly, plugging in R{t) < 2/21 gives 



3' 



Pr[^] = |.E[2-^i] 

< 58/441 + I . (4 ■ A ■ 4^ 

= 58/441 + 1/189 

< 0.1369. 



We now see that with two threads, the probability of reli- 
able execution is substantially closer to that of weak ordering 
(0.1296) than that of sequential consistency (0.1666). □ 

Proof (Theorem 16. 3|) . We first analyze the probability 
for Sequential Consistency. As the strongest memory model, 
the probability of successful execution serves as an upper 
bound for every other model. This is because the likelihood 
that the shift process results in an overlap is monotonically 
increasing in the distribution of critical window size. 

Sequential Consistency: Again, recall from Corollary 
521 that 



Pr[A] =Er[Pr[^(r)]l 

= c(n) ■ 2-("J') -^Er 



Under sequential consistency, "/cr{i) ~ 2 always. Hence 

n-l 

Pr[A] = c{n) ■ 2-("J') . n 2"'""'" 

fj i — l 

= c(n) ■ 2-("J') -n! -2-^(2) 



= 2 



^n2(3/2+o{l)) 



where the last line follows from Stirling's formula: 



exp 



/ln(27r) 



V 2 '2 

(n liin)(l + o(l)) 
n2-o(l) 



+ + {lnn-l)n] (l + o{l)) 



other Models: Surprisingly, to achieve the same bound 
for any model, all we need is a lower bound on the proba- 
bility of generating a small critical window. 

Claim B.2. In every memory model, 
Pr[Bo] > i 

Proof. This can be observed by the fact that no matter 
the model, the critical LD must move up to have any chance 
of the critical window growing. But even if the critical LD 
is allowed to pass the instruction above it, this only occurs 
with probability 1/2. □ 

The claim has the following consequences. 

n-1 

Pr[/\7, = 2] >2-("-^), 

hence 

n-1 

PrlYl 2f""*'^<'W = 2"^(3)] > 2-("-i), 

thus 

n-1 

E[Y\_ 2'""*'''''W] > 2"^(2)"'^""^'. 

i=l 

Plugging this expectation into the probability of correct ex- 
ecution again gives 

Pr[A] > c{n) ■ 2-("^') ■ nl ■ 2-2(2)-("-i) ^ a-'^^/^'. 

Recall that Sequential Consistency offers the largest proba- 
bility of correct execution of any model. Hence upper bound- 
ing the above value by the probability for Sequential Con- 
sistency, we have 

Pr[A] < 2-"'(3/2+°(i)). 
This completes the proof. □ 



